{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "183222bc-eb44-4385-8bf0-2dc55263ec2f",
   "metadata": {},
   "source": [
    "# RAGatouille Retriever Llama Pack \n",
    "\n",
    "RAGatouille is a [cool library](https://github.com/bclavie/RAGatouille) that lets you use e.g. ColBERT and other SOTA retrieval models in your RAG pipeline. You can use it to either run inference on ColBERT, or use it to train/fine-tune models.\n",
    "\n",
    "This LlamaPack shows you an easy way to bundle RAGatouille into your RAG pipeline. We use RAGatouille to index a corpus of documents (by default using colbertv2.0), and then we combine it with LlamaIndex query modules to synthesize an answer with an LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6051d4b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-llms-openai\n",
    "%pip install llama-index-packs-ragatouille-retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98a4e842-47fa-4403-a2d2-7047dd2bddea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Option: if developing with the llama_hub package\n",
    "from llama_index.packs.ragatouille_retriever import RAGatouilleRetrieverPack\n",
    "\n",
    "# Option: download_llama_pack\n",
    "from llama_index.core.llama_pack import download_llama_pack\n",
    "\n",
    "# RAGatouilleRetrieverPack = download_llama_pack(\n",
    "#     \"RAGatouilleRetrieverPack\",\n",
    "#     \"./ragatouille_pack\",\n",
    "#     skip_load=True,\n",
    "#     # leave the below line commented out if using the notebook on main\n",
    "#     # llama_hub_url=\"https://raw.githubusercontent.com/run-llama/llama-hub/jerry/add_llm_compiler_pack/llama_hub\"\n",
    "# )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65bbc8e8-f9e3-4d5c-8c36-309cac65969a",
   "metadata": {},
   "source": [
    "## Load Documents\n",
    "\n",
    "Here we load the ColBERTv2 paper: https://arxiv.org/pdf/2112.01488.pdf."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8cde968-5d4f-42c3-b2c6-dd97b4656901",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-01-04 16:02:16--  https://arxiv.org/pdf/2004.12832.pdf\n",
      "Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...\n",
      "Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 4918165 (4.7M) [application/pdf]\n",
      "Saving to: ‘colbertv1.pdf’\n",
      "\n",
      "colbertv1.pdf       100%[===================>]   4.69M  --.-KB/s    in 0.1s    \n",
      "\n",
      "2024-01-04 16:02:16 (34.6 MB/s) - ‘colbertv1.pdf’ saved [4918165/4918165]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget \"https://arxiv.org/pdf/2004.12832.pdf\" -O colbertv1.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc41a62d-7d3e-431c-9d8a-08447df14b11",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "reader = SimpleDirectoryReader(input_files=[\"colbertv1.pdf\"])\n",
    "docs = reader.load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f6f043b-42cf-4efc-8797-a12c3abd3872",
   "metadata": {},
   "source": [
    "## Create Pack"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43d767af-5f4e-4dbf-9cfe-8df13df3b9e4",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "[Jan 04, 16:02:19] #> Note: Output directory .ragatouille/colbert/indexes/my_index already exists\n",
      "\n",
      "\n",
      "[Jan 04, 16:02:19] #> Will delete 10 files already at .ragatouille/colbert/indexes/my_index in 20 seconds...\n",
      "#> Starting...\n",
      "[Jan 04, 16:02:42] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 04, 16:02:43] [0] \t\t #> Encoding 90 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " 50%|█████     | 1/2 [00:03<00:03,  3.87s/it]/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      "100%|██████████| 2/2 [00:05<00:00,  2.64s/it]\n",
      "WARNING clustering 14894 points to 1024 centroids: please provide at least 39936 training points\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 04, 16:02:48] [0] \t\t avg_doclen_est = 174.1888885498047 \t len(local_sample) = 90\n",
      "[Jan 04, 16:02:48] [0] \t\t Creating 1,024 partitions.\n",
      "[Jan 04, 16:02:48] [0] \t\t *Estimated* 15,676 embeddings.\n",
      "[Jan 04, 16:02:48] [0] \t\t #> Saving the indexing plan to .ragatouille/colbert/indexes/my_index/plan.json ..\n",
      "Clustering 14894 points in 128D to 1024 clusters, redo 1 times, 20 iterations\n",
      "  Preprocessing in 0.00 s\n",
      "[0.037, 0.037, 0.033, 0.033, 0.033, 0.035, 0.035, 0.035, 0.032, 0.036, 0.032, 0.031, 0.035, 0.036, 0.035, 0.036, 0.034, 0.037, 0.033, 0.034, 0.036, 0.036, 0.035, 0.035, 0.033, 0.036, 0.036, 0.033, 0.037, 0.035, 0.035, 0.037, 0.036, 0.033, 0.037, 0.031, 0.035, 0.036, 0.035, 0.042, 0.037, 0.037, 0.037, 0.036, 0.036, 0.033, 0.034, 0.037, 0.036, 0.032, 0.034, 0.036, 0.038, 0.038, 0.035, 0.034, 0.039, 0.035, 0.036, 0.034, 0.035, 0.038, 0.035, 0.037, 0.035, 0.036, 0.04, 0.033, 0.034, 0.034, 0.038, 0.034, 0.038, 0.036, 0.038, 0.035, 0.037, 0.04, 0.036, 0.04, 0.037, 0.037, 0.037, 0.037, 0.034, 0.036, 0.034, 0.037, 0.032, 0.039, 0.037, 0.036, 0.034, 0.038, 0.035, 0.033, 0.039, 0.036, 0.035, 0.035, 0.039, 0.038, 0.034, 0.035, 0.037, 0.033, 0.033, 0.031, 0.035, 0.035, 0.035, 0.038, 0.036, 0.033, 0.035, 0.035, 0.038, 0.035, 0.035, 0.036, 0.036, 0.039, 0.036, 0.039, 0.034, 0.038, 0.038, 0.034]\n",
      "[Jan 04, 16:02:48] [0] \t\t #> Encoding 90 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]\u001b[A\n",
      " 50%|█████     | 1/2 [00:03<00:03,  3.32s/it]\u001b[A\n",
      "100%|██████████| 2/2 [00:04<00:00,  2.34s/it]\u001b[A\n",
      "1it [00:04,  4.72s/it]\n",
      "100%|██████████| 1/1 [00:00<00:00, 5322.72it/s]\n",
      "100%|██████████| 1024/1024 [00:00<00:00, 331171.82it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 04, 16:02:53] #> Optimizing IVF to store map from centroids to list of pids..\n",
      "[Jan 04, 16:02:53] #> Building the emb2pid mapping..\n",
      "[Jan 04, 16:02:53] len(emb2pid) = 15677\n",
      "[Jan 04, 16:02:53] #> Saved optimized IVF to .ragatouille/colbert/indexes/my_index/ivf.pid.pt\n",
      "\n",
      "#> Joined...\n",
      "Done indexing!\n"
     ]
    }
   ],
   "source": [
    "index_name = \"my_index\"\n",
    "ragatouille_pack = RAGatouilleRetrieverPack(\n",
    "    docs, llm=OpenAI(model=\"gpt-3.5-turbo\"), index_name=index_name, top_k=5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad909df3-8fbb-4316-881c-ccfaeb947d51",
   "metadata": {},
   "source": [
    "## Try out Pack\n",
    "\n",
    "We try out both the individual modules in the pack as well as running it e2e! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebe26f99-5cbb-42de-8dcb-b173b1668d1a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New index_name received! Updating current index_name (my_index) to my_index\n",
      "Loading searcher for index my_index for the first time... This may take a few seconds\n",
      "[Jan 04, 16:02:55] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n",
      "[Jan 04, 16:02:56] #> Loading codec...\n",
      "[Jan 04, 16:02:56] #> Loading IVF...\n",
      "[Jan 04, 16:02:56] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 04, 16:02:56] #> Loading doclens...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5555.37it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 04, 16:02:56] #> Loading codes and residuals...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 521.10it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 04, 16:02:56] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 04, 16:02:56] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n",
      "Searcher loaded!\n",
      "\n",
      "#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==\n",
      "#> Input: . How does ColBERTv2 compare with SPLADEv2?, \t\t True, \t\t None\n",
      "#> Output IDs: torch.Size([32]), tensor([  101,     1,  2129,  2515, 23928,  2615,  2475, 12826,  2007, 11867,\n",
      "        27266,  6777,  2475,  1029,   102,   103,   103,   103,   103,   103,\n",
      "          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,\n",
      "          103,   103])\n",
      "#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "        0, 0, 0, 0, 0, 0, 0, 0])\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 5e4028f7-fbb5-4440-abd0-0d8270cc8979<br>**Similarity:** 17.003997802734375<br>**Text:** While highly competitive in eﬀec-\n",
       "tiveness, ColBERT is orders of magnitude cheaper than BERT base...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** d6240a29-0a5e-458f-86f1-abe570e13200<br>**Similarity:** 16.764663696289062<br>**Text:** Note that any BERT-based model\n",
       "must incur the computational cost of processing each document\n",
       "at l...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** d19c0fe7-bdb7-4a51-ae89-00cd746b2d3a<br>**Similarity:** 16.70589828491211<br>**Text:** For instance,\n",
       "its Recall@50 actually exceeds the oﬃcial BM25’s Recall@1000 and\n",
       "even all but docTT...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 38e84e5b-4345-4b08-a7fd-de2de4fa645a<br>**Similarity:** 16.577777862548828<br>**Text:** /T_his layer serves to control the dimension\n",
       "of ColBERT’s embeddings, producing m-dimensional emb...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** c82df506-412a-40c2-baf3-df51ab43e434<br>**Similarity:** 16.252092361450195<br>**Text:** For instance, at k=10, BERT requires nearly\n",
       "180\u0002more FLOPs than ColBERT; at k=1000, BERT’s overhe...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.core.response.notebook_utils import display_source_node\n",
    "\n",
    "retriever = ragatouille_pack.get_modules()[\"retriever\"]\n",
    "nodes = retriever.retrieve(\"How does ColBERTv2 compare with other BERT models?\")\n",
    "\n",
    "for node in nodes:\n",
    "    display_source_node(node)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b206fab9-a980-44c8-8e76-7e340f7d08eb",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'content': 'While highly competitive in eﬀec-\\ntiveness, ColBERT is orders of magnitude cheaper than BERT base,\\nin particular, by over 170 \\x02in latency and 13,900 \\x02in FLOPs. /T_his\\nhighlights the expressiveness of our proposed late interaction mech-\\nanism, particularly when coupled with a powerful pre-trained LM\\nlike BERT. While ColBERT’s re-ranking latency is slightly higher\\nthan the non-BERT re-ranking models shown (i.e., by 10s of mil-\\nliseconds), this diﬀerence is explained by the time it takes to gather,\\nstack, and transfer the document embeddings to the GPU. In partic-\\nular, the query encoding and interaction in ColBERT consume only\\n13 milliseconds of its total execution time. We note that ColBERT’s\\nlatency and FLOPs can be considerably reduced by padding queries\\nto a shorter length, using smaller vector dimensions (the MRR@10\\nof which is tested in §4.5), employing quantization of the document\\n6h/t_tps://github.com/mit-han-lab/torchpro/f_ile',\n",
       "  'score': 17.003997802734375,\n",
       "  'rank': 1},\n",
       " {'content': 'Note that any BERT-based model\\nmust incur the computational cost of processing each document\\nat least once. While ColBERT encodes each document with BERT\\nexactly once, existing BERT-based rankers would repeat similar\\ncomputations on possibly hundreds of documents for each query.\\nSe/t_ting Dimension( m) Bytes/Dim Space(GiBs) MRR@10\\nRe-rank Cosine 128 4 286 34.9\\nEnd-to-end L2 128 2 154 36.0\\nRe-rank L2 128 2 143 34.8\\nRe-rank Cosine 48 4 54 34.4\\nRe-rank Cosine 24 2 27 33.9\\nTable 4: Space Footprint vs MRR@10 (Dev) on MS MARCO.\\nTable 4 reports the space footprint of ColBERT under various\\nse/t_tings as we reduce the embeddings dimension and/or the bytes\\nper dimension.',\n",
       "  'score': 16.764663696289062,\n",
       "  'rank': 2},\n",
       " {'content': 'For instance,\\nits Recall@50 actually exceeds the oﬃcial BM25’s Recall@1000 and\\neven all but docTTTTTquery’s Recall@200, emphasizing the value\\nof end-to-end retrieval (instead of just re-ranking) with ColBERT.\\n4.4 Ablation Studies\\n0.220.240.260.280.300.320.340.36\\nMRR@10BERT [CLS]-based dot-product (5-layer)  [A]\\nColBERT via average similarity (5-layer)  [B]\\nColBERT without query augmentation (5-layer)  [C]\\nColBERT (5-layer)  [D]\\nColBERT (12-layer)  [E]\\nColBERT + e2e retrieval (12-layer)  [F]\\nFigure 5: Ablation results on MS MARCO (Dev). Between\\nbrackets is the number of BERT layers used in each model.\\n/T_he results from §4.2 indicate that ColBERT is highly eﬀective\\ndespite the low cost and simplicity of its late interaction mechanism.',\n",
       "  'score': 16.70589828491211,\n",
       "  'rank': 3},\n",
       " {'content': '/T_his layer serves to control the dimension\\nof ColBERT’s embeddings, producing m-dimensional embeddings\\nfor the layer’s output size m. As we discuss later in more detail,\\nwe typically /f_ix mto be much smaller than BERT’s /f_ixed hidden\\ndimension.\\nWhile ColBERT’s embedding dimension has limited impact on\\nthe eﬃciency of query encoding, this step is crucial for controlling\\nthe space footprint of documents, as we show in §4.5. In addition, it\\ncan have a signi/f_icant impact on query execution time, particularly\\nthe time taken for transferring the document representations onto\\nthe GPU from system memory (where they reside before processing\\na query). In fact, as we show in §4.2, gathering, stacking, and\\ntransferring the embeddings from CPU to GPU can be the most\\nexpensive step in re-ranking with ColBERT. Finally, the output\\nembeddings are normalized so each has L2 norm equal to one.\\n/T_he result is that the dot-product of any two embeddings becomes\\nequivalent to their cosine similarity, falling in the »\\x001;1¼range.\\nDocument Encoder.',\n",
       "  'score': 16.577777862548828,\n",
       "  'rank': 4}]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# try out the RAG module directly\n",
    "RAG = ragatouille_pack.get_modules()[\"RAG\"]\n",
    "results = RAG.search(\n",
    "    \"How does ColBERTv2 compare with other BERT models?\", index_name=index_name, k=4\n",
    ")\n",
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "985406f6-beb5-49cc-b1b7-6943c4e91201",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ColBERTv2, which employs late interaction over BERT base, performs no worse than the original adaptation of BERT base for ranking. It is only marginally less effective than BERT large and our training of BERT base. While highly competitive in effectiveness, ColBERTv2 is orders of magnitude cheaper than BERT base, particularly in terms of latency and FLOPs.\n"
     ]
    }
   ],
   "source": [
    "# run pack e2e, which includes the full query engine with OpenAI LLMs\n",
    "response = ragatouille_pack.run(\"How does ColBERTv2 compare with other BERT models?\")\n",
    "print(str(response))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_hub",
   "language": "python",
   "name": "llama_hub"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
